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Abstract 

We propose a novel problem formulation of 
learning a single task when the data are pro- 
vided in different feature spaces. Each such 
space is called an outlook, and is assumed 
to contain both labeled and unlabeled data. 
The objective is to take advantage of the data 
from all the outlooks to better classify each 
of the outlooks. We devise an algorithm that 
computes optimal affinc mappings from dif- 
ferent outlooks to a target outlook by match- 
ing moments of the empirical distributions. 
We further derive a probabilistic interpreta- 
tion of the resulting algorithm and a sample 
complexity bound indicating how many sam- 
ples are needed to adequately find the map- 
ping. We report the results of extensive ex- 
periments on activity recognition tasks that 
show the value of the proposed approach in 
boosting performance. 



1. Introduction 

It is often the case that a learning task relates to 
multiple representations, to which we refer as out- 
looks. Samples belonging to different outlooks may 
have varying feature representations and distinct dis- 
tributions. Furthermore, the outlooks are not related 
through corresponding instances, but just by the com- 
mon task. 

Multiple outlooks may be found in many real life prob- 
lems. For example, in activity recognition when data 
from different users, representing the outlooks, are col- 
lected from different sensors. Note that each outlook 
may have a totally different feature representations, 
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while the recognition task is common to all outlooks. 
The ability to learn from these different representa- 
tions is formulated by multiple outlook learning. A 
different example for multiple outlooks learning is clas- 
sification of document corpora written in different lan- 
guages. In this case, each language represents a differ- 
ent outlook. In these situations, the transformations 
between the outlooks are unknown and feature or sam- 
ple correspondence is not available. Consequently, it 
is rather difficult to learn the task at hand while ex- 
ploiting the information in different representations. 

The goal of multiple outlook learning is to use the 
information in all available outlooks to improve the 
learning performance of the task. We propose to ap- 
proach this learning problem in a two step procedure. 
First, we map the empirical distributions of the dif- 
ferent outlooks one to another. After the outlooks' 
distributions are matched, a generic classification al- 
gorithm can be applied using the available examples 
from all the outlooks. 

This approach allows to transfer an outlook of which 
we have little information to another where we have 
more information. That is, mapping the data to the 
same space effectively enlarges our sample size and 
may also give us a better representation of the prob- 
lem. We show that a classifier learned in the resulting 
space may outperform each single classifier. 

In general, matching multiple distributions, without 
feature alignment or assuming a parametric model, is 
a difficult task. Therefore, we propose to match the 
empirical moments of the distributions as an approxi- 
mation. We present an algorithm for finding one such 
mapping. The algorithm's objective is to find the op- 
timal affine transformations of the outlooks' spaces, 
while maintaining isometry within classes. From a geo- 
metric point of view, our algorithm is based on match- 
ing the centers and the main directions of the outlooks' 
sample distributions. One virtue of the algorithm is its 
simple closed form solution. 
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2. Related work 

Learning from multiple outlooks is related to other se- 
tups such as domain adaptation, multiple view learn- 
ing and manifold alignment. The main challenge in 
these setups, as in ours, is that the training and test 
data are drawn from different distributions. 

Domain adaptation tries to resolve a common scenario 
when some changes have been made to the test dis- 
tribution, while the labeling function of the domains 
remains more or less the same. Some authors por- 
tray this situation by assuming a single hypothesis may 
classify both domains well (Blitzer et al., 2007), while 
others assume the target's posterior probability is 
equal for the domains (Shimodair, 2000; Huang et al., 
2007) . The latter assumption is also referred to as the 
covariatc shift problem. 

Algorithms for domain adaptation may be roughly 
divided to three categories. One approach is to rc- 
wcigh the training instances so they better resemble 
the test distribution (Shimodair, 2000; Huang et al., 
2007). Such algorithms are derived from the covari- 
ate shift assumption, which is in some sense one of 
the outlook mapping goals. A different approach 
is to combine the classifiers learnt in each domain 
(Mansour et al., 2009). Last, some works suggest 
to change the feature representation of the domains. 
This may be carried out by choosing a subset of fea- 
tures (Satpal & Sarawagi, 2007), combination of fea- 
tures (Daumc III, 2007), or by finding some structural 
correspondence between features in different domains 
(Blitzer et al., 2006). All the described approaches en- 
tail an initial common feature representation for the 
domains. Thus domain adaptation is a special case of 
the multiple outlook problem, for the case of outlooks 
with a common feature space. In Section 6 we show 
that our approach can also be applied to this problem. 

Multiple outlook learning is also closely related to the 
multi-view setup (Ruping & Scheffer, 2005). In this 
setup, each view contains the same set of samples rep- 
resented by different features. Clearly, any multiple 
view data is also some instance of a multiple outlook 
data with the added requirement that each sample 
has observations from multiple outlooks. One com- 
mon approach is to map a pattern matrix of each view 
to a consensus pattern by matching corresponding in- 
stances (Longetal., 2008; Hou et al., 2010). Note 
that in the multiple outlook framework each outlook 
contains a unique set of samples, thus sample to sam- 
ple correspondence is impossible. Amini et al. (2009) 
considers the case when correspondence is missing for 
some instances, but assumes the existence of a map- 
ping functions between the views. 



Multi-view learning is sometimes referred to as man- 
ifold alignment. In manifold alignment we look for 
a transformation of two data sets with sample pair- 
wise correspondence that minimizes the distance be- 
tween them, in an unsupervised (Wang & Mahadevan, 
2008) or a semi-supervised (Ham et al., 2005) manner. 
Wang & Mahadevan (2009) present manifold align- 
ment without pairwise correspondence. To our knowl- 
edge, this is the only work on manifold alignment that 
does not assume a pairwise matching of the samples. 
The algorithm presented in this work is not originally 
suited for classification as our algorithm. 

3. Mapping Two Outlooks 

3.1. Problem Setting 

The learner is given two outlooks belonging to sep- 
arate input spaces X\ and X 2 of dimension d 1 and 
d 2 respectively, with a common target y = {l,...,c}. 
We assume that all example pairs of a given outlook 
j = 1,2 are independently drawn from an unknown 
distribution T>j, which is unique to each outlook. De- 
note by X\ and X\ the data matrices of class i of 
outlook 1 and 2, respectively. We use superscripts to 
denote the outlooks' index, and subscripts to denote 
the classification class. 

3.2. Multiple Outlook MAPping algorithm 

In this section we present our main Multiple Outlook 
MAPping algorithm (MOMAP) for matching the rep- 
resentations of two outlooks. Throughout the deriva- 
tions outlook 2 is mapped to outlook 1, which is some- 
times referred to as the final outlook. Our goal is to 
map an outlook where we have ample labeled data, to 
an outlook where little labeled information is available. 

As a preliminary step to the mapping algorithm scaling 
is applied. The scaling is applied to each of the out- 
looks separately, and aims to normalize the features of 
all outlooks to the same range. Note that this stage 
may be done using unlabeled data when available. 

Next, we use the labeled samples to match the two 
outlooks. The goal of this stage is to map the scaled 
representations by rotation and translation. Specif- 
ically, the mapping is performed by translating the 
means of each class to zero, rotating the classes to fit 
each other well, and then translating the means of the 
mapped outlook to the final outlook. 

Let < fjq , jj,\ > be the set of empirical means of 
the outlooks. Wc translate the empirical means of each 
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class of both outlooks to zero: 



X 



(j) 



X^-flf t=l,...,c,i = l,2. 



(1) 



Next, we turn to matching the main directions of the 
classes by rotation. Note that a rotation matrix may 
be defined in many manners. We search for mappings 
in the set of all orthonormal matrices (rotation and 
reflection). Our choice of mapping by rotation is mo- 
tivated by its isometry property, which allows us to 
maintain the relative distance between the samples. 
We construct utilization matrices for each of the out- 
looks as follows. Define uf' as the utilization matrix 
of outlook j and class i. D\ and D t are concate- 
nated matrices constructed from the h < min(d 1 ,d ) 
principal directions of the corresponding outlook and 
class. That is, the h eigenvectors of the empirical 
covariance matrices T,\ , £,- corresponding to the h 
largest eigenvalues. 

Using the utilization matrices we find the rotation 
matching the outlooks by solving the following opti- 
mization problem: 



{Ri} = aTgynnJ2\\RiDr - D 
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where 



subject to: R i R4 = / i = 1, 

, is the Frobenius norm. 
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To gain some intuition on Problem (2) we disassemble 
a term in the sum of the objective function 
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where v^ (I = 1, ..., h) are the principal directions of 
the i th class of outlook j. We obtain that Problem 
(2) is equivalent to maximization of the sum of inner 
products between the principal directions of outlook 1 
and the rotated principal directions of outlook 2, which 
in turn implies minimization of the first h principal 
angles between the classes of both outlooks. 

Although Problem (2) is not convex it can be solved in 
closed form. For the solutions constructed in this stage 
we borrow techniques from the literature of Procrustes 
Analysis (Gowcr & Dijkstcrhuis, 2004). Problem (2) is 
equivalent to 



arg max ^ tr \R i Df ) D { ? )T 
subject to: Rj Ri = I 



(3) 



l,...,c. 



Problem (3) is separable, thus each component in the 
sum may be optimized separately. In the following 
derivations we drop the subscript i for brevity. 



Algorithm 1 Matching two outlooks 



Input: empirical moments fi~ \/i,j. 
for i — 1 to c do 
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Xf> = MatchByRotation{Xl L, ,Xl z> ). 
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end for 
Output: X 
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Algorithm 2 MatchByRotation 



Input: matrices A (1) ,A (2) . 

Construct matrices D^^D^- 2 '. 

Compute SVD factorization D^D^ T = USV T . 

R = VU T . 

Output: A( 2 ) = XWR T . 



Let USV T be the singular value decomposition (SVD) 
of DW>DV> T . Define Z = V T RU . Then, 

tr (RD^D^ T ) = tr (RUSV T ) = 



tr (ZS) = ^2 z kk°k <^2°~k, 



fe=i 



i— k 



where o^ is the fc-th singular value of D^ Dt 1 ' 1 ' . The 
upper bound is attained for R = VU T since in that 
case Z = I (Algorithm 2). 

After the rotation, we translate the classes to match 
the original means of the final outlook. The above 
derivation gives rise to an algorithm that matches two 
given outlooks. The algorithm is described in Algo- 
rithm 1. 

Remark 1. Each outlook need not have the same di- 
mension. In this case, the orthonormal constraint can 
not be obtained as R is no longer a square matrix. 
However, this problem can be easily solved. Suppose 
that D\ and D\ have different numbers of rows. 
Then, simply add rows of zeros to the smaller dimen- 
sional configuration until the dimensions arc equalized. 
In this manner, we embed the smaller configuration in 
the space of the larger one. 

Remark 2. Algorithm 1 does not rely on any corre- 
sponding instances in both outlooks . However, when 
available, such instances may aid the mapping accu- 
racy and can be easily incorporated into the algorithm. 
It is possible to do so by adding columns of the corre- 
sponding instances to the utilization matrices. 
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4. Extension to Multiple outlooks 

We present an extension of Algorithm 1 to the case 
of multiple outlooks. The multiple outlook scenario 
allows us to use the information available in all the 
outlooks to allow better learning of each one. To do 
so, we transform all the outlooks one to another. As 
for two outlooks, we begin by translating the means of 
each class of all the outlooks to zero. In the rotation 
step, the optimal rotations are found by solving 






(4) 



subject to: R^ R ( p = I Vi, 



J- 



Observe that Algorithm 2 produces an optimal solu- 
tion with zero error, as there is always a perfect rota- 
tion between two sets of h orthogonal vectors. There- 
fore, one optimal solution of (4), which attains an ob- 
jective value of zero, is to rotate all outlooks to a cho- 
sen final outlook. Namely, for m outlooks m — 1 ro- 
tation matrices arc computed for each class. Finally, 
shift the means of the rotated outlooks to those of the 
final outlook. 

If we want to switch the choice of final outlook, all 
we need to do is apply the inverse mapping of the 
relevant outlook to all mapped outlooks. For example, 
to switch from outlook s to k one needs to apply the 
following transformation: 



X 



(A) 
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5. Analysis 



In this section wc give a probabilistic robust inter- 
pretation of the rotation process, and prove a sample 
complexity bound on the convergence of the estimated 
rotation matrix . 

5.1. Probabilistic Interpretation 

In this section wc discuss the effect of adding random 
noise to the utility matrices on the optimal rotation 
between two outlooks (Problem (2)). Wc do not as- 
sume knowledge of the probability distribution of the 
noise. Instead, we use its bounded total value for some 
chosen confidence level. We show that the solution to 
the noised problem is bounded by the sum of the solu- 
tion to the original problem and a constant value that 
depends on the noise. Notably, the noise only has an 
additive effect to the bound. 

Let A be the additive random uncertainty to the 

(2) 

utility matrix D\ for some class i. Suppose that 



this uncertainty follows an unknown joint distribu- 
tion A ~ V . This uncertainty may be portrayed by a 
chancc-constraincd extension of Problem (2) 1 : 



mm r 

R t R=I,t 
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where r) G [0, 1] is the desired confidence level. 

Optimization of the chance constrained problem is 
natural, as it obtains, with high probability, the 
optimal rotation. However, despite their intuitive 
probabilistic form, chance constrained problems are 
generally intractable (Shapiro et al., 2009), thus 
we approximate Problem (5) as follows. We define 
p* — inf a {Pva~v (||A||f < a) > 1 — r/} and obtain that 
with probability at least 1 — 77 



R(D^ + A) - £>W 



< max 

F ||A|| F <p* 



R(D^ + A) - D« 



Therefore, Problem (5) is upper bounded by the fol- 
lowing minmax problem 



mm max 

-R T fl=J||A|| F <p* 
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This is the robust version to the original rotation prob- 
lem, with the uncertainty set U = {A | ||A|| p < p*} 2 . 
Next, we construct the robust counterpart of (6). 

Theorem 1. Problem (6) is equivalent to 



mm 
R T R= 



in ( RD& - D^ 

1=1 V 



1 * 

+ p . 



The proof is provided in A.l. The theorem shows 
that Problem (2) is robust to a perturbation of a total 
bounded value. That is, for a bounded noise, the only 
difference between the solution to the original prob- 
lem and its robust version (Problem (6)) is an additive 
constant p* . From a probabilistic point of view, the 
solution of this problem also provides a bound on the 
chance constrained problem in (5). 

5.2. Sample complexity bounds 

Wc next provide a bound for the sample complexity of 
the rotation step of the algorithm. 



1 Since Problem (2) is separable, the extension is done to 
each class separately. We drop the subscript i, representing 
the class, from the following derivations for brevity. 

The original rotation problem was actually the square 
of the Frobenius error. However, the two problems are 
equivalent since taking the square does not change the so- 
lution. 
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Assumption 1. (Gaussian Mixture) Each outlook is 
generated by a unique mixture of c Gaussian distribu- 
tions, where c is the number of classes. The samples 
of each outlook are realizations of x ~ Si=i u; */*( a; )j 
where fi{x) ~ jV(/U,,S,) and J2i=i w i = 1- We f ur ~ 
ther assume that ||Ex:r T j < 1 for each component. 
Theorem 2. Suppose that Assumption 1 holds. For 
each outlook, let S,€i,e € (0,1), (i = l,..,c) and sup- 
pose that the number of samples for each class i satis- 
fies: 

n,>C^log 2 ^W« 



Then 






R-R 



log' 



<e > 1 



\ S J 



where, R is the estimated rotation matrix found by Al- 
gorithm 2, d is the dimension and C is a constant. 

The proof of the theorem is provided in A. 2. Note that 
the sample complexity of the mapping algorithm is 
dominated by the rotation stage. In practice, the num- 
ber of chosen principal directions h is usually small. 
Also note that the bound on the norm of the second 
moment in Assumption 1 is achieved by the scaling 
stage. 

6. Experiments 

In this section wc demonstrate our framework on activ- 
ity recognition data, in which different users represent 
different outlooks. In this application, the multiple 
outlooks setup allows for valuable flexibility in real life 
recordings. For example, some users may use a simple 
sensor configuration for recordings, while others use a 
complex sensor board of multiple sensors. Also, this 
setup may resolve problems of varying sampling rates 
when using different hardware and workloads. 

In our experiments we test two setups: a domain adap- 
tation setup and a multiple outlook setup. For the 
domain adaptation setup a common feature represen- 
tation is used, while for the multiple outlook setup a 
unique feature space is used for each user. 

6.1. Data set description and feature 
extraction 

The data set used for the experiments was collected 
by Subramanya et al. (2006) using a customized wear- 
able sensor system. The system includes a 3-axis 
accclcrometer, phototransistors for measuring light, 
barometric pressure sensors, and GPS data. The data 
consist of recordings from 6 participants who were 
asked to perform a variety of activities and record the 
labels. We used the following labels: walking, run- 
ning, going upstairs, going downstairs and lingering. 



After removing data with obvious annotation errors 
the data consists of about 50 hours of recording, di- 
vided approximately evenly among the 6 users. For 
each user the activities are roughly divided into 40% 
walking, 40 — 50% lingering, 2 — 5% running, 2 — 3% 
going upstairs, and 2 — 3% going downstairs. See 
(Subramanya et al., 2006) for further details on the 
sensor system and the recordings. 

From the raw data we extracted windowed samples 
as follows. From the accelerometer data we used the 
x-axes measurements sampled at 512Hz, which we dec- 
imated to 32Hz. The barometric pressure sampled at 
7.1Hz, was smoothed and interpolated to 32Hz. Next, 
we applied a two-second sliding window over each 
signal using a window of appropriate length. From 
each window a feature vector is extracted containing 
the Fourier coefficients of the accelerometer data, the 
mean of the gradient of the barometric pressure, and 
the mean values of the light signals. All together we 
obtained 20-35 thousand samples for each user with 37 
features. 

As explained in Section 3.2, before mapping the out- 
looks scaling should be applied to all the outlooks. For 
all the experiments, we scale the data to [0,1]. To re- 
duce the sensitivity of the scaling to outliers we first 
collapse the extreme two percentile of the data to the 
value of the extreme remaining values (also known as 
Winsorization) . Scaling parameters are chosen on the 
training data and applied to the test data. This pre- 
processing was applied to all baseline classifiers. 

6.2. Domain Adaptation Setup 

As mentioned above, multiple outlook learning may 
also be applied for domain adaptation. We tested both 
standard domain adaptation of two domains, as well 
as multiple source domain adaptation. 

For the two domain problem we adopted the commonly 
used terminology in domain adaptation of source and 
target domains. We applied Algorithm 1 for different 
fractions of target labeled data and fully labeled source 
data. The performance was computed by 10-fold cross- 
validation, each fold containing random samples from 
each class according to its fraction in the complete set. 
The only parameter of the algorithm h was chosen on 
a random split. 

We test the success of the mapping algorithm by classi- 
fication of the target test data with a classifier trained 
on the mapped source data, denoted as the MOMAP 
classifier (no target data was used for training). This 
is a multi-class classification problem, with five possi- 
ble labels. Wc use a multi-class SVM classifier with an 
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RBF-kernel (C = 64, 7 = 0.25 3 ) obtained by LIBSVM 
software (Chang & Lin, 2001). The data arc unevenly 
distributed among the five classes, therefore we use the 
balanced error rate (BER) as a performance measure: 
BER = - Xa=i ~ e h where ej and n, are the numbers 
of errors and number of samples in class i respectively, 
and c is the number of classes. 

We compare the MOMAP classifier to the following 
baselines: a target only classifier, trained on the avail- 
able labeled target data (TRG); a source only clas- 
sifier, trained on the source data (SRC); a classifier 
trained on all available labeled data of target and 
source (ALL); and the domain adaptation algorithm 
presented in (Daumc III, 2007) (FEDA). We also add 
the "optimal" error, obtained by training on the fully 
labeled target data (OPT). 

The results are presented in Figure 1. It can be ob- 
served that the MOMAP classifier outperforms the 
baseline classifiers for most fractions of target labeled 
data. The algorithm performs well across all sets of 
users, for example, for 5% labeled data it is signifi- 
cantly better (p-value< 0.05) than the TRG, SRC and 
FEDA classifiers for all sets, and significantly better 
than the ALL classifier for 18 out of 30 possible sets 
(see Table 1 in A.3). 

In the next experiment we consider mixtures of m 
source domains with some labeled data (both train- 
ing and test sets are mixtures). We use the extension 
to multiple outlooks presented in Section 4 to find the 
mappings of the sources to each outlook. We test the 
classification performance on each component of the 
mixture with a classifier trained on all the mapped 
sources. The final performance measure is the mean 
BER averaged on all the sources. As in the previ- 
ous experiment, the evaluation was done by 10-fold 
cross-validation, with the same classifier. The base- 
lines are similar, with the change of the TRG to the 
mean value of multiple classifiers trained in each do- 
main, and the ALL baseline to a classifier trained on 
all sources (the SRC classifier was not relevant). The 
experiment was performed on all 20 triplet combina- 
tions. Sample results are presented in Figure 2. These 
trends were consistent across users, for example, for 
15% of labeled data the MOMAP algorithm outper- 
forms all other classifiers for 15 of the combinations 
(p-value< 0.05). In the 5 remaining combinations, 
the algorithm performed significantly better than the 
TRG and FEDA algorithms, and equally well as the 
ALL classifier (see Table 2 in A.3). For larger portions 
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Figure 1. Domain adaptation setup for 2 domains. 
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3 The parameters were chosen on the target classifica- 
tion problem. Common parameters were chosen for clear 
performance comparison of the different classifiers. 



Figure 2. Domain adaptation setup for multiple outlooks: 
users 1,2 and 5. 

of labeled data the MOMAP algorithm also obtained 
smaller error than the ALL classifier (p-valuc< 0.05). 
The effect of the ALL classifier may be a result of 
some regularization obtained from training on data 
from similar yet different domains. 

6.3. Multiple Outlook Setup 

We conducted three types of experiments for the mul- 
tiple outlook setup, each with a different feature repre- 
sentation. The experiments' setup was similar to the 
previous experiments with some adjustments to the 
baselines: the SRC, ALL and FEDA baselines were no 
longer relevant, as the outlooks' features differ. 

In the first experiment we tested the multiple outlook 
algorithm on two outlooks for the case of different sen- 
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Figure 3. Two outlooks with different sensors. Final out- 
look: accelerometer and pressure. Mapped outlook: ac- 
celerometer, pressure and light sensors. The missing fea- 
tures in the final outlook are replaces by noise. 

sors and added noise features. For the mapped out- 
look we used full feature representation (37 features). 
For the target outlook we used the accelerometer's 
and pressure features, and excluded the light measure- 
ments. Instead of the light features we added features 
with Gaussian random noise (Af(0, 1)). The experi- 
ment was performed on all pair combinations. For 5% 
labeled data of the learned outlook, the mean BER 
of the MOMAP was 4.5% (±2.7%) lower than that of 
the TRG classifier. The results for four user pairs are 
presented in Figure 3. These results show that the 
mapping was successful, as training on the mapped 
data outperforms training on partial data in the tar- 
get outlook. In Fig. 3(c) the MOMAP algorithm has 
lower error than the OPT classifier for some fractions; 
this may be a result of the added information in the 
light features. 

In the second experiment we tried to learn from two 
outlooks with a different number of features result- 
ing from different sampling rates. Specifically, for the 
learned outlook we kept the full feature representation 
as described in Section 6.1, while for the mapped out- 
look we used the same type of features but with 30Hz 
sampling rate instead of 32Hz. This resulted in 37 
features in the target outlook and 35 in the mapped 
one. Note that our algorithm may be easily modified 
for this scenario; see Remark 1 in Section 3.2. For 
5% labeled data the MOMAP algorithm had on aver- 



Figure 4. Multiple outlook learning for two outlooks with 
different sampling rates. 

age 5.9% (±2.4%) lower BER than the TRG classifier. 
Figure 4 presents the results on four user pairs. In 
Figs. 4(a) and 4(c) the MOMAP algorithm has lower 
error than the OPT classifier. Observe that this is pos- 
sible since the balanced error rate is presented, which 
treats the error in different classes equally (namely, 
the MOMAP classifier does not outperform the non- 
balanced error). 

In the third experiment we constructed the feature rep- 
resentation of each outlook from the 33 accelerometer's 
features to which wc added 10 features of Gaussian 
noise (7V(0, 1)). We then randomly permuted the order 
of the features of each outlook. For this experiment, 
we used samples belonging to the walking, running 
and lingering classes, as we did not use the full feature 
set. The experiment was performed for two outlooks as 
well as for multiple outlooks. The results indicate the 
performance boost from MOMAP especially for the 
running activity. Due to space limitations we provide 
the results in A. 4. 

7. Future Work 

Our proposed approach is a first step in developing 
the methodology for learning from multiple outlooks. 
This approach may be extended to many interesting 
directions. First, in this paper wc only considered 
affinc mappings between the outlooks and a natural 
extension is to consider richer classes of transforma- 
tions such as piecewise linear mappings. Also, our ap- 
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proach is batch in the sense that first all the data have 
to be processed and then the classification algorithm 
can be used. A different extension of practical interest 
would be to develop an online version of the proposed 
approach that takes samples one by one and gradually 
improves the mapping. Finally, a major application 
domain, of independent interest, is natural language 
processing. Here the challenge would be to use a lan- 
guage where labels are abundant to better classify in a 
different language. The main obstacle here seems to be 
the nature of representation: language data are often 
represented as sparse vectors which may call for a dif- 
ferent type of transformations between the outlooks. 

References 

Amini, M., Usunier, N., and Goutte, C. Learning from 
Multiple Partially Observed Views-an Application 
to Multilingual Text Categorization. In Advances 
in Neural Information Processing Systems, 2009. 

Blitzer, J., McDonald, R., and Pereira, F. Domain 
adaptation with structural correspondence learning. 
In Proceedings of the 2006 Conference on Empir- 
ical Methods in Natural Language Processing, pp. 
120-128. Association for Computational Linguistics, 
2006. ISBN 1932432736. 

Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., and 
Wortman, J. Learning bounds for domain adapta- 
tion. In Advances in Neural Information Processing 
Systems, volume 20, pp. 129-136. Citesecr, 2007. 

Chang, C. and Lin, C. LIBSVM: a library for sup- 
port vector machines, 2001. Software available at 
http: //www. csie .ntu. edu.tw/~cjlin/libsvm. 

Daume III, H. Frustratingly Easy Domain Adapta- 
tion. In Proceedings of the 45th Annual Meeting of 
the Association for Computational Linguistics ACL, 
volume 1, pp. 256-263. Association for Computa- 
tional Linguistics, 2007. 

Cower, JC and Dijkstcrhuis, G.B. Procrustes Prob- 
lems. Oxford University Press, USA, 2004. 

Ham, J., Lee, D., and Saul, L. Scmisupervised align- 
ment of manifolds. In Proceedings of the Annual 
Conference on Uncertainty in Artificial Intelligence, 
Z. Ghahramani and R. Cowell, Eds, volume 10, pp. 
120-127. Citeseer, 2005. 

Hou, O, Zhang, O, Wu, Y., and Nie, F. Multiple view 
semi-supervised dimensionality reduction. Pattern 
Recognition, 43(3):720-730, 2010. ISSN 0031-3203. 

Huang, J., Smola, A.J., Gretton, A., Borgwardt, K.M., 
and Scholkopf, B. Correcting sample selection bias 



by unlabeled data. In Advances in Neural Informa- 
tion Processing Systems, volume 19, pp. 601. Cite- 
seer, 2007. 

Long, B., Yu, P.S., and Zhang, ZM. A general model 
for multiple view unsupervised learning. In Pro- 
ceedings of the 8th SIAM International Conference 
on Data Mining (SDM 08), Atlanta, Georgia, USA, 
2008. 

Mansour, Y., Mohri, M., and Rostamizadch, A. Do- 
main adaptation with multiple sources. In Ad- 
vances in Neural Information Processing Systems, 
volume 21, pp. 1041-1048. Citeseer, 2009. 

Rudclson, M. and Vershynin, R. Sampling from large 
matrices: An approach through geometric func- 
tional analysis. Journal of the ACM (J ACM), 54 
(4):21, 2007. 

Riiping, S. and Schefier, T. Learning with multiple 
views. In Proceeding of the International Conference 
on Machine Learning Workshop on Learning with 
Multiple Views, 2005. 

Satpal, S. and Sarawagi, S. Domain adaptation of con- 
ditional probability models via feature subsetting. 
In Proceedings of Principles of Data Mining and 
Knowledge Discovery, pp. 224-235. Springer, 2007. 

Shapiro, A., Dentcheva, D., Ruszczynski, A., and 
Ruszczyhski, A. P. Lectures on stochastic program- 
ming: modeling and theory. Society for Industrial 
Mathematics, 2009. ISBN 089871687X. 

Shimodair, H. Improving predictive inference under 
covariate shift by weighting the log-likelihood func- 
tion. Journal of Statistical Planning and Inference, 
90:227-244, 2000. 

Stewart, G.W and Sun, J.G. Matrix Perturbation The- 
ory. Academic Press, 1990. 

Subramanya, A., Raj, A., Bilmes, J., and Fox, D. Rec- 
ognizing activities and spatial context using wear- 
able sensors. In Proceedings of the Conference on 
Uncertainty in Artificial Intelligence. Citeseer, 2006. 

Wang, C. and Mahadevan, S. Manifold alignment us- 
ing Procrustes analysis. In Proceedings of the 25th 
International Conference on Machine Learning, pp. 
1120-1127. ACM, 2008. 

Wang, C. and Mahadevan, S. Manifold alignment 
without correspondence. In Proceedings of the 21st 
International Joint Conferences on Artificial Intel- 
ligence, 2009. 



Learning from Multiple Outlooks 



A. Appendix 

A.l. Proof of Theorem 1 

The next theorem presents the robust counterpart of 
Problem (6); the robust version of the optimization for 
the two outlooks rotation problem (each component in 
Problem 2). We restate the theorem for clarity: 

Theorem 1. Problem (6) is equivalent to 



mm 



RD™ - D^ 



Proof. We obtain an explicit expression for the maxi- 
mization in (6). By definition, the norm may be writ- 
ten as 



max Lr(D (2) + A) - D (1) = 
||A|| F <p*ll \\f 

max tr ( V T (R(D {2) + A) - D w )) = 

||A|| F <p*,||V|| F <l V / 

max \tr (v T (RD {2) - D (1) )) + max tr ( V T RA 
\\v\\ F <i { V V ||A|| F < P * V 

(7) 



Next, we develop an explicit representation of the in- 
ner maximization over A. By applying the Cauchy- 
Schwartz inequality and the unitary invariance of the 
Frobenius norm wc obtain an upper bound: 



max tr ( V T RA) < max ||V"||f||-RA|| f = p*||T/|| F . 

||A|| F <p* V / ||A|| F <p* 

Let A* = R T V/\\V\\ F . Observe that 

max tr (v T RA) > tr (v T RA*) = p*\\V\\ F . 

l|A|| F <p* V J ~ \ J 



A.2. Proof of Theorem 2 

We restate the theorem for clarity: 

Theorem 2. (Sample complexity of rotation for two 
outlooks) Suppose that Assumption 1 hold. Then, for 
8,€i,e € (0, 1), if the number of samples for each class 
and outlook i satisfies: 



. „dh 2 , 2 
n-i > C —pt log 



/ 32d/i 2 



l0g {—) 



then 



P 



R-R 



< e > 1 



where, R is the estimated rotation matrix found by al- 
gorithm 2, d is the dimension and C is a constant. 

Before providing the proof we present the following 
lemmas: 

Lemma 3 (Sample complexity of estimating mean). 
Let Assumption 1 hold. Then for 5, e € (0, 1) if each 
class and outlook satisfies: n > 2° log (&\ then 

P(||£-ju||<e)>l-<J, 

where fl and fi are the empirical and true mean of each 
component of the mixture. 



Proof. We use er r ^ lax = maxfc (of.) as the maximal 
directional variance of the j th mixture , and o~k as 
the standard deviation of the samples k th coordi- 
nate. By applying Chernoff's method on each coor- 
dinate of |/tfe — Hk\, k = 1, ..., d and then applying the 

Vlog(|) 



union bound we obtain that for n > — 
\\p, — fi\\ 2 < e holds with probability of at least 1 — 6. 
The bound is obtained by applying cr^ ax < 1, which is 
implied from Assumption 1 □ 

Lemma 4. Let X be a set of n points drawn from a 
one dimensional Gaussian with mean fi and variance 
a 2 . With probability 1 — 6, 



We conclude, that max||A|| F < P * tr (V T RA) 
P*\\V\\f- Inserting this equation into (7) we obtain: 



max \\R(D {2) + A) - D w \\ = 

||A|| F <p. II \\f 

max ^\tr (y T ' (RD {2) - £> (1) )) +p*||V||f 

\\ RD (i)_ D W\\ +p \ 



- H\ < cr A /21og(-J VxeX. 



Lemma 5. Let x\, ...,x n be a set of independent real- 
izations of random vectors from a multivariate normal 
distribution in W l . Then with probability of at least 
1-6. 



It) 



which concludes the proof. 



□ 



Proof. By the reverse triangle inequality we have that 
N||-NI<IIN|-NII<l|a:*-Ml|. 
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By applying Lemma 4 on a single coordinate of the 
random vectors Xi we get 

p (H fe) -H^)^ nexp (-^)-^ 

Taking the union bound over the d coordinates we get 
that with probability at least 1 — 6 



and a = e 2 \ c/\og (4). This results in the condition 



a = — ; > C 



2dlog(^)log(n) 



log(|) 



which is satisfied for the choice of 



INI -' IImII < Iki — At|| < cr^/2dlog ( — 



□ 

Lemma 6 (Sample complexity of covariancc estima- 
tion). Let X be a set of random samples generated 
from a Gaussian distribution with covariance T, and 
zero mean fj, = 0. Define E, fi as the estimated co- 
variance matrix and mean of the sample. Then for 
<^ £i, £2 G (0, 1), for a sample size of 



d 2 /2d\ 2 (2d 

n ^ l0g U log T 



we have that 



P 



E-E 



<d 



e 2 > 1 - 6. 



Proof. The concentration bound is obtained by divid- 
ing the error to two components, 



E — E < /i/x — fifl + 



1 " 



(8) 



n 2 > ^log^jlog 2 ^ 



(10) 



We get the final sample bound by taking the maximum 
between the sample complexity of the mean (9) and 
the covariance estimation (10). □ 

Proof of Theorem 2. Observe that by applying Equa- 
tion (1) to each class and outlook we have that each 
component has zero mean. By Lemma 3, the sample 
complexity of this step is n, > || log (|) (for each class 
and outlook i). In the following derivations we assume 
zero mean of the components' distribution. We show 
that the sample complexity of both stages is domi- 
nated by the rotation. 

By substituting the finite and infinite sample rotation 
matrices with the values defined in Alg. 2 and applying 
the triangular inequality twice we have that 



R-R 



VU T - VU T 



< \\V\\ ||AJ7|| + ||AV|| ||A17|| + ||AF|| \\U\\ , (11) 



We begin by bounding the first component: 
Recall that /i = 0, so the first component is bounded 
by \\jl\\ 2 . We apply Lemma 3 and obtain that with 
probability at least 1 — | : 



2d, [2d 

m>-\o g [j 

\\m 2 <ti- 



(9) 



The second component is bounded by a concentra- 
tion inequality for covariance matrices presented by 
Rudclson & Vcrshynin (2007). For completeness we 
add the relevant theorem; see Theorem 7. The second 
moment condition holds by Assumption 1. The sec- 
ond condition, of bounded sample norm is obtained 
as follows. By applying Lemma 5 and bounding 
the variance according to Assumption 1, we get that 

INI < ^/2dlog(^). 

Next, we apply Theorem 3.1 of 

(Rudelson & Vershynin, 2007) with t 2 = a 2 log (§) /c 



where AV = V - V and AU = U - U. Recall 
that the matrices U, U, V, V are the matrices of sin- 
gular vectors resulting from the SVD decompositions 

D (2)fj(l)T = (jsyT and D (2) D (1)T = jj gyT We 

apply the perturbation theory of the SVD decompo- 
sition presented in (Stewart & Sun, 1990) and bound 
Eq. (11) by 



R-R 



<C 



D (2) D (1)T _ fj(2)fj(l)T 



(,12) 



where C is a constant. Observe that 

D {2) D {l)T _ fj(2)fj(l)T 
D (2) D (1)T _ D {2) f)(l)T 



< 



D (2)fj(l)T _ fj(2)fj(l)T 



<Vh 



D (l)T_j J (l)T 



Vh 



D 



(2) 



D 



(2) 



=V^(||ADi|| F + ||A£> 2 || F ). 



(13) 



The second inequality holds by the sub-multiplicative 
property of the Frobenius norm. When the number of 
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columns h < d, columns of zeros need to be added to 
make the matrices square. 

Define v, and v, (I = l,...,h) to be the h eigen- 
vectors of matrices D^> and D" respectively. The 



following holds ||AA||j = Eti *i " v i h Y def- 
inition. Define the perturbation of the covariance ma- 
trix of mixture i by Ei = E.; — £.;. By applying the 
perturbation theory of the eigen decomposition on the 
perturbed covariance matrices (Stewart & Sun, 1990) 



(p.240) we get that 



>(') 



r w 



<C\\Ei 



Last, we use Lemma 6 to bound Ei for each outlook 

(i = 1,2). If the number of samples for each outlook 

is 

,dh 2 , 9 fi2dh 2 \ , 2 /4faT 



m > C~y- log 






log z 



V s 



then 



Zj.,; Zj, 



which implies 



P[\\*Di\ 



< 



< 



Cj,l + £ t ,2 

Ah 



Ej,l +£j, 2 

4^ 



^-^ 



^"2- 



Plugging in the bound to (13) we get the final bound: 



P 



D (2) D (1)T _ jj{2)jj{l)T 



<e)>l-6, 



for some e = \ Yli=i 2 e i,i + e >,2 € (0, 1). 



□ 



Theorem 7 (Theorem 3.1 from 

(Rudelson & Vcrshynin. 2007)). Let x be a ran- 
dom vector in M. d from distribution D, which is 
uniformly bounded almost everywhere: \\x\\ < M, and 
Exx T < 1. Let x\...x n be independent samples 
generated from D . Define 



CM 



logn 



A. 3. Domain Adaptation setup - Results 

Following are results obtained on all users for the do- 
main adaptation experiment. Table 1 presents the 
results for two users obtained on 5% labeled target 
data. Table 2 presents the results for multi-source 
domain adaptation with three users, each with 15% 
labeled data. Both tables contain the balanced error 
rate (BER) on the five class classification task. High- 
lighted results represent significance of the result with 
p-value< 0.05. 

Table 1. Domain Adaptation setup for two users 
(5% labeled Target). 



S ^T 


MOMAP 


FEDA 


TRG 


SRC 


ALL 


2^1 


0.208 


0.280 


0.249 


0.255 


0.234 


3 -> 1 


0.228 


0.292 


0.269 


0.209 


0.2 


4^1 


0.221 


0.293 


0.256 


0.246 


0.233 


5 -> 1 


0.21 


0.304 


0.27 


0.23 


0.216 


6 -> 1 


0.255 


0.294 


0.265 


0.345 


0.283 


1 -> 2 


0.20 


0.29 


0.26 


0.21 


0.20 


3^2 


0.212 


0.281 


0.253 


0.215 


0.205 


4^2 


0.186 


0.287 


0.252 


0.216 


0.209 


5^2 


0.191 


0.281 


0.249 


0.223 


0.208 


6^2 


0.203 


0.27 


0.244 


0.352 


0.271 


1 -> 3 


0.216 


0.281 


0.26 


0.23 


0.224 


2^3 


0.214 


0.271 


0.256 


0.265 


0.241 


4^3 


0.215 


0.276 


0.252 


0.233 


0.222 


5^3 


0.213 


0.298 


0.264 


0.278 


0.237 


6^3 


0.210 


0.276 


0.251 


0.359 


0.282 


1 -> 4 


0.233 


0.277 


0.256 


0.309 


0.253 


2^4 


0.231 


0.269 


0.264 


0.314 


0.265 


3^4 


0.245 


0.281 


0.27 


0.276 


0.249 


5^4 


0.235 


0.289 


0.27 


0.313 


0.246 


6^4 


0.243 


0.267 


0.262 


0.422 


0.293 


1 -> 5 


0.228 


0.307 


0.272 


0.244 


0.237 


2^5 


0.237 


0.29 


0.275 


0.289 


0.267 


3^5 


0.233 


0.289 


0.261 


0.239 


0.228 


4^5 


0.22 


0.286 


0.258 


0.258 


0.243 


6^5 


0.221 


0.269 


0.247 


0.3 


0.259 


1 -> 6 


0.234 


0.376 


0.321 


0.294 


0.273 


2^6 


0.238 


0.37 


0.316 


0.305 


0.273 


3^6 


0.254 


0.386 


0.344 


0.261 


0.247 


4^6 


0.235 


0.374 


0.326 


0.294 


0.263 


5^6 


0.244 


0.379 


0.325 


0.246 


0.239 



where C is an absolute constant. Then, for every t G 
(0,1), 



P 



1 ™ 



T 
X%X^ 



[XX 



> 



t < 2e- ct2/a ' 2 
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Table 2. Domain Adaptation setup 
(15% labeled Target). 



Multi-users 



mapping aids in learning the mixture. 



Users 


MOMAP 


FEDA 


TRG 


ALL 


1 2 3 


0.205 


0.232 


0.227 


0.214 


1 2 4 


0.203 


0.235 


0.224 


0.214 


1 2 5 


0.203 


0.236 


0.22 


0.213 


1 2 6 


0.211 


0.253 


0.238 


0.226 


134 


0.207 


0.233 


0.224 


0.22 


1 3 5 


0.208 


0.24 


0.226 


0.21 


1 3 6 


0.221 


0.255 


0.239 


0.228 


1 4 5 


0.208 


0.237 


0.223 


0.219 


1 4 6 


0.214 


0.252 


0.236 


0.232 


1 5 6 


0.222 


0.257 


0.239 


0.228 


2 34 


0.214 


0.234 


0.229 


0.216 


2 3 5 


0.21 


0.235 


0.228 


0.215 


2 3 6 


0.218 


0.243 


0.236 


0.225 


2 4 5 


0.204 


0.233 


0.221 


0.212 


2 4 6 


0.216 


0.254 


0.239 


0.232 


2 5 6 


0.226 


0.257 


0.243 


0.226 


34 5 


0.219 


0.239 


0.231 


0.222 


34 6 


0.224 


0.258 


0.244 


0.235 


3 5 6 


0.227 


0.254 


0.239 


0.225 


4 5 6 


0.222 


0.252 


0.242 


0.232 




walking running lingering 
(a) User 2 — > User 1 




walking running lingering 
(c) User 4 — > User 5 



0.25 
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walking running lingering 
(b) User 1 -> User 4 




walking running lingering 
(d) User 5 — > User 6 



A. 4. Multiple outlook setup - Experiment 3 

In the third experiment we constructed the feature rep- 
resentation of each outlook from the 33 accelerometer's 
Fourier coefficients to which we added 10 features of 
random Gaussian noise A/ r (0, 1). We then randomly 
permuted the order of the features of each outlook. 
For this experiment, we used samples belonging to the 
walking, running and lingering classes, as we did not 
use the full feature set. The experiment was performed 
for the two outlook scenario as well as for multiple out- 
looks. 

Figure 4 shows the results for 5% labeled target data 
for different users couples. It can be observed, that 
for the walking and lingering activities the mapped 
outlook performs similarly to the TRG classifier. For 
all cases, the mapped outlook classifies the running 
activity with least errors. Among all user pairs the 
MOMAP classifier obtained smaller error for the run- 
ning activity (3.5% — 45% smaller for 5% labeled data). 
The results show the boosting power of the mapping, 
which, as may be expected, is most powerful for the 
classes with less labeled data. An interesting behav- 
ior is that even when all labeled data is available the 
MOMAP algorithm sometimes outperforms the classi- 
fier learned in the target outlook (OPT). This may be 
caused by some regularization obtained by the map- 
ping. Note, however, that for the total error, on all 
three classes, the MOMAP classifier does not outper- 
form OPT classifier. The results for multiple outlooks 
are presented in Figure 6. It can be observed that the 



Figure 5. Multiple outlook setup for two outlook with 
added noise features and randomly permuted features. 




(a) Users 1,4 and 6 



—- MOMAP 
— TGT 




(b) Users 2,3 and 5 

Figure 6. Multiple outlooks learning for mixture of m — 3 
outlooks. Noise features are added to each outlook and 
then the features are randomly permuted. 



